BiT: Robustly Binarized Multi-Distilled Transformer
145
Then the gradients w.r.t. β can be similarly calculates as:
∂Xi
B
∂β
ST E
≈α∂Clip( Xi
R−β
α
, 0, 1)
∂β
=
−1,
if β ⩽Xi
R < α + β
0,
otherwise
(5.43)
For the layers that contain both positive and negative real-valued activations i.e., XR ∈
Rn, the binarized values ˆXB ∈{−1, 1}n are indifferent to the scale inside the Sign function:
Xi
B = α · Sign( Xi
R−β
α
) = α · Sign(Xi
R −β). In that case, since the effect of scaling factor α
inside the Sign function can be ignored, the gradient w.r.t. α can be simply calculated as
∂Xi
B
∂α = Sign(Xi
R −β).
5.10.3
Multi-Distilled Binary BERT
Classical knowledge distillation (KD) [87] trains the outputs (i.e., logits) of a student net-
work to be close to those of a teacher, which is typically larger and more complex. This
approach is quite general, and can work with any student-teacher pair which conforms to
the same output space. However, knowledge transfer happens faster and more effectively in
practice if the intermediate representations are also distilled [1]. This approach has been
useful when distilling to student models with similar architecture [206], particularly for
quantization [6, 116].
Note that having a similar student-teacher pair is a requirement for distilling repre-
sentations. While how similar they need to be is an open question, intuitively, a teacher
who is architecturally closer to the student should make transfer of internal representations
easier. In the context of quantization, it is easy to see that lower precision students are
progressively less similar to the full-precision teacher, which is one reason why binarization
is difficult.
This suggests a multi-step approach, where instead of directly distilling from a full-
precision teacher to the desired quantization level, the authors first distilled into a model
with sufficient precision to preserve quality. This model can then be used as a teacher to
distill into a further quantized student. This process can be repeated multiple times, while
at each step ensuring that the teacher and student models are sufficiently similar, and the
performance loss is limited.
The multi-step distillation follows a quantization schedule, Q = {(b 1
w , b 1
a ), (b 2
w , b 2
a ), . . . ,
(b k
w , b k
a )} with (b 1
w , b 1
a ) > (b 2
w , b 2
a ) > . . . > (b k
w , b k
a )1. (b k
w , b k
a ) is the target quantization
level. In practice, the authors found that down to a quantization level of W1A2, and one
can distill models of reasonable accuracy in single shot. As a result, they followed a fixed
quantization schedule, W32A32 →W1A2 →W1A1.
BiT, which is shown in Fig. 5.16, combines the elastic binary activations with multi-
distillation obtain, BiT simultaneously ensures good initialization for the eventual student
model. Since the binary loss landscape is highly irregular, good initialization is critical to
aid optimization.
In summary, this paper’s contributions can be concluded as: (1) The first demonstration
of fully binary pre-trained BERT models with less performance degradation. (2) A two-
set binarization scheme, an elastic binary activation function with learned parameters, a
multi-distillation method to boost the performance of binarzed BERT models.
1(a, b) > (c, d) if a > c and b ≥d or a ≥c and b > d.